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ABSTRACT 

It is common practice for data scientists to acquire and in¬ 
tegrate disparate data sources to achieve higher quality re¬ 
sults. But even with a perfectly cleaned and merged data 
set, two fundamental questions remain: (1) is the integrated 
data set complete and (2) what is the impact of any unknown 
(i.e., unobserved) data on query results? 

In this work, we develop and analyze techniques to esti¬ 
mate the impact of the unknown data (a.k.a., unknown un¬ 
knowns) on simple aggregate queries. The key idea is that 
the overlap between different data sources enables us to es¬ 
timate the number and values of the missing data items. 
Our main techniques are parameter-free and do not assume 
prior knowledge about the distribution. Through a series 
of experiments, we show that estimating the impact of un¬ 
known unknowns is invaluable to better assess the results of 
aggregate queries over integrated data sources. 

1. INTRODUCTION 

In the past few years, the number of data sources has in¬ 
creased exponentially because of the ease of publishing data 
on the web, the proliferation of data-sharing platforms (e.g., 
Google Fusion Table [ 19 or Freebase [15]), and the adop¬ 
tion of open data access policies, both in science and govern¬ 
ment. The success of crowdsourcing [lT}[l3] 30, 29, 38][2}|51, 
[To] provides another virtually unlimited source of informa¬ 
tion. This deluge of data has enabled data scientists, both 
in commercial enterprises and in academia, to acquire and 
integrate data from multiple data sources, achieving higher 
quality results than ever before. It is therefore not surpris¬ 
ing that industry and academia alike have developed highly 
sophisticated systems and tools to assist data scientists in 
the process of data integration [28]. However, even with a 
perfectly cleaned and integrated data set, two fundamental 
questions remain: (1) do the data sources cover the com¬ 
plete data set of interest and (2) what is the impact of any 
unknown (i.e., unobserved) data on query results? 

1.1 Unknown Data 

In this work, we develop techniques to estimate the impact 
of the unknown data on aggregate queries of the form SELECT 
AGGREGATE(attr) FROM table WHERE predicate. 

We assume a simple data integration scenario, as depicted 
in Figure [l] Several domain-related data sources are inte¬ 
grated into one database, preserving the lineage information 
for each data item or record. Naturally, these data sources 
overlap with each other, but even when put together they 
might not be complete. For example, all data sources in 
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Figure 1: Simple data integration scenario where multiple 
data sources overlap but are not necessarily complete. 

Figure ^ might list U.S. tech companies but some smaller 
companies might not be mentioned in any of the sources. 
This data integration scenario applies to a wide range of 
use cases ranging from crowdsourcing (where every crowd- 
worker can be considered a single data source |l3]) to data 
extraction from web pages. 

Estimating the impact of the unknown data (data items 
that are not observed in any data source) is particularly 
difficult as we neither know how many unique data items 
are missing and their values; thus, we deal with unknown 
unknowns. This characteristic distinguishes our work from 
what is generally known as missing data , or known unknowns , 
estimation in Statistics 1,43, 40] , which tries to estimate the 
value of unknown (missing) attributes for known records. At 
a first glance, it may seem impossible to estimate the impact 
of unknown unknowns’, however, for a large class of data in¬ 
tegration scenarios, the analysis of overlap of multiple data 
sources makes it feasible. 

1.2 A Running Example 

To demonstrate the impact of unknown unknowns , we 
pose a simple aggregate query to calculate the number of all 
employes in the U.S. tech industry, SELECT SUM (employees) 
FROM us_tech_companies, over a crowdsourced data set. We 
used techniques from 13] to design the crowdsourcing tasks 
on Amazon Mechanical Turk (AMT) to collect employee 
numbers from U.S. tech companies Q The data was manually 
cleaned before processing (e.g., entity resolution, removal of 
partial answers). Figure [ 5 ] shows the result. 

The red line represents the ground truth (i.e., the total 
number of employees in the U.S. tech sector) for the query 
39 , whereas the grey line shows the result of the observed 
SUM query over time with the increasing number of re¬ 
ceived crowd-answers. The gap between the observed and 

x More precisely, we only asked for companies with a pres¬ 
ence in Silicon Valley, as we found it provides more accurate 
results (see also Section [6]). 
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Figure 2: Employees in the U.S. tech sector 
the ground truth is due to the impact of the unknown un¬ 
knowns , which gets smaller at a diminishing rate as more 
crowd-answers arrive. 

While the experiment was conducted in the context of 
crowdsourcing, the same behavior can be observed with other 
types of data sources, such as web pages. For instance, sup¬ 
pose a user searches the Internet to create a list of all solar 
energy companies in the U.S. The first few web pages will 
provide the greatest benefit (i.e., more new solar compa¬ 
nies), while after a dozen web pages the benefit of adding 
another web page diminishes as the likelihood of duplicates 
increases. The rate of increasing overlap of data sources is 
indicative of the completeness of the data set. 
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1.3 A Naive Solution 

The same type of diminishing effect is also known as the 
Species Accumulation Curve in Ecology [47], where the rate 
of new species discovered decreases with increasing cumula¬ 
tive effort to search. Measuring species richness (i.e., count¬ 
ing species) is critical in many ecological studies. Plotting a 
Species Accumulation Curve provides a way to estimate the 
number of additional species to be discovered. 

These species estimation techniques lay the foundation for 
estimating the impact of unknown unknowns on aggregate 
query results. A naive solution for the SUM query from 
Section 11.21 would be to first estimate the number of un¬ 


known data items using species estimation techniques 46 
and then use mean substitution to estimate their value [37 
This assumes that the missing items have on average the 
same attribute value as the observed (known) data items. 

The naive approach has a couple of drawbacks. First, 
species estimation has very strict requirements on how data 
is collected. Almost every data integration scenario violates 
these requirements, causing the estimator to significantly 
over/underestimate the number of missing data items. 

Second, it ignores the fact that the attribute values of the 
missing items may be correlated to the likelihood of observ¬ 
ing certain data items. For example, large tech companies 
like Google with many employees are often more well known 
and thus, appear more often in data sources than smaller 
start-ups, creating a biased data set. This is problematic as 
it also biases the mean and with it the estimate. 

In the statistics literature, this second problem is referred 
to as Missing Not At Random (MNAR) [37, [43], where the 
missingness of a data item depends on its value. There are 
many statistical inference techniques dealing with MNAR [l | 
10, 9, l][52]|40], but nearly all the techniques require at least 
partial knowledge of the record. For example, in the case of 
surveys, people with a high salary might be more reluctant 
to report their salary but have no problem stating their home 
address or how many children they have. Existing MNAR 
techniques use the reported values (e.g., the address) to infer 
the missing attributes. Unfortunately, this is not possible in 
the case of unknown unknowns , as we miss the entire record. 


1.4 Contributions 

This work is a first step towards developing techniques 
to estimate the impact of the unknown unknowns on query 
results. Our focus is on simple aggregate queries, especially 
SUM- aggregates, but we also touch upon other aggregations 
like COUNT , AVG , MIN , and MAX. We design techniques 
that can deal with the peculiarities of the data integra¬ 
tion scenarios discussed before, such as uneven contributions 
from different sources (bias of data sources). 

In this work, we use crowdsourced data sets because they 
are easier to collect, but the techniques are general and ap¬ 
ply to almost all data integration scenarios that combine 
overlapping data sources. While we do not argue that the 
proposed techniques can predict black-swan-like data items 
(i.e., extremely rare data items), we will show that our tech¬ 
niques can provide useful estimates under more “normal” 
circumstances, which we will define more formally. For in¬ 
stance, in the example of Figure [2] we can get an almost 
perfect estimate of the impact of the unknown unknowns af¬ 
ter only 350 crowd-answers. In addition, by building upon 
recent work on the Good-Turing estimator [31], we are able 
to provide an upper bound for our estimates under easy to 
understand conditions. In summary we make the following 
contributions: 

• We formalize the problem of estimating the impact of un¬ 
known unknowns on query results and describe why ex¬ 
isting techniques for species estimation and missing data 
estimation are not sufficient. 

• We develop techniques to estimate the impact of the un¬ 
known unknowns on aggregate query results. 

• We derive a first upper bound for SUM- aggregate queries. 

• We examine the effectiveness of our techniques via exper¬ 
iments using both real and synthetitc data sets. 

In the following, we first formalize our problem statement 
(Section [ 5 ]), presents techniques to estimate the impact of 
unknown unknowns for sum-queries (Section [ 3 ]) and propose 
an upper bound estimate (Section [4]). Section [ 5 ] extends 
these techniques then to other aggregate functions and in 
Section [6] we evaluate our techniques, followed by related 
work and conclusion. 


2. THE IMPACT OF UNKNOWN 
UNKNOWNS 

In this Section, we define unknown unknowns , explain how 
data integration over multiple sources can be regarded as a 
sampling process and formally define our estimation goal. 
For convenience Appendix |A| contains a symbol-table. 

For the purpose of this work, we treat data cleaning (e.g., 
entity resolution, data fusion, etc.) as an orthogonal prob¬ 
lem. Any data cleaning techniques 0 m m m nl m 
can be applied to our problem without altering the problem 
context. While data quality can influence the estimation 
quality, studying it goes beyond the scope of this paper [46] . 
We assume that after a proper data cleaning process we have 
one instance per observed entity and know exactly how many 
times the entity was observed across multiple data sources. 

2.1 Unknown Unknowns 

We assume that queries are of the form SELECT AGGRE¬ 
GATE (attr) FROM table WHERE predicate, that table only 
contains records about a single entity class (e.g., companies) 
and that a record in table corresponds to exactly one real- 
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world entity (e.g., IBM). Thus, in the remainder of the paper 
we use record, entity and data item interchangeably. 


Definition 1. (Unknown Unknowns) Let Ui be the uni¬ 
verse of unknown size of all valid unique entities r for a 
given entity class and attrA(r) be the value of attribute A of 
r. Then the ground truth D C Ll is defined as a set of entities 
that satisfy the predicate, i.e., D = {r E Q | predicate^)}, 
where its size N — \D\ is not known. Let S be a sample with 
replacement from D and c be the number of unique entities 
in S. Unknown unknowns U refers to any unobserved entity 
r that exists in D but not in S: U m D — S with size N — c. 


For our running example, D would be the universe of all 
companies in the world, D all tech companies in the US and 
s ff JreD a tU ern pi(r) be the true number of U.S. tech sector em¬ 
ployees. S would be a sample with duplicates and unknown 
unknowns would be every company which is not in S. 

What we aim to achieve is a good estimate of the ground 
truth: SELECT AGGREGATE (attr) FROM D, when we only have 
S. Note, that we drop the predicate from the query, since 
every item in D already has to fulfill the predicate. In this 
work, we assume that we neither know all entities in D nor 
its size (i.e., open world assumption). This distinguishes our 
problem from the problem of missing data [37 43 44 , which 
refers to incomplete data or missing attribute values. 


2.2 Data Integration As Sampling Process 

Data integration refers to the process of combining dif¬ 
ferent data sources under a common schema [l2]. For the 
purpose of this work, we assume that data sources are inde¬ 
pendent samples (e.g., data source are not copies from each 
other and instead are independently created), and we model 
the data integration process as a multi-stage sampling pro¬ 
cess as shown in Figure [ 3 ] 

We assume l data sources S 1 ...S 1 , each sampling rij — \sj\ 
data items from the ground truth D (e.g., the complete set 
of tech companies in the US with their respective number of 
employees), without replacement, as a data source typ¬ 
ically only mentions a data item once. We assume further 
that every data item di 6 D has a publicity likelihood p\ 
of being sampled, following some distribution X. Likewise, 
the attribute values (e.g., the number of employees) have 
a certain likelihood to appear in the ground truth, referred 
to as value likelihood , again following some distribution Y. 
These two distributions are possibly correlated making the 
publicity-value correlation bigger or smaller than 0: p ^ 0. 


The data sources l are then integrated into a single in¬ 
tegrated data set S of size ns = Sj=i n j- Although each 
source samples without replacement from N — \D\ differ¬ 
ent classes (i.e., unique data item), S contains duplicates 
because every data source is sampling from the same under¬ 
lying truth D. If Z is sufficiently large, S approximates 
a sample with replacement from D , which is the reason 
why species estimation techniques work in the first place (we 
analyze the effects of smaller l in Section 3.4 andj6|. The 
number of unique data items c in S is likely to be smaller 
than N. In contrast, the end-user only sees a view of S, 
referred to as the integrated database K (for Known data), 
which contains only one entity per unique entity in S. 

This data integration model covers a large class of use 
cases from web integration to crowdsourcing. In the latter 
case, each crowd worker can be regarded as a separate data 
source Sj as it is known that workers also sample without re¬ 
placement from D [46]. While extremely powerful, there are 
scenarios where this sampling model does not apply. Most 
importantly, data sources are not always independent 24i. 
Furthermore, the number of data sources l has to be large 
enough to have sufficient overlap between the sources (see 
Section [6]) . If any of these assumptions are violated, then 
only low-quality estimations are possible. 


2.3 Problem Statement 

We are interested in estimating the impact of unknown 
unknowns (U) to adjust aggregate query results. 


Definition 2. (The Impact of Unknown Unknowns) Given 
an integrated database K, the impact of unknown unknowns 
is defined as the difference between the current answer <f>K 
of the aggregate query over the database K and the answer 
over the ground-truth 0r>: 

A = (j) D — <f>K (1) 


Our goal is to estimate the answer on the ground-truth 
by estimating A based on S: 

4>d — 4>k + A (S') (2) 

Note that this definition works for all common aggregates 
including MIN and MAX , where A would be the positive or 
negative adjustment to the observed MIN/MAX value. 

3. SUM QUERY 

In this section, we focus on SUM-aggregates to illustrate 
our estimation techniques. We first formalize the naive esti¬ 
mator (Section |3.1| ), which was informally introduced in the 
introduction. We then develop the frequency estimator by 
making naive estim ator more robust to the publicity-value 
correlation (Section |3.2| ). Afterwards, we describe the more 
sophisticated bucket estimator (Section |3.3| ). Finally, we de¬ 
velop a Monte-Carlo estimator which is better suited for a 
smaller set of data sources (Section |3.4| ). 

3.1 Naive Estimator 

Estimating the impact of unknown unknowns for SUM 
queries is equivalent to solving two sub-problems: (1) esti¬ 
mating how many unique data items are missing (i.e., the 
unknown unknowns count estimate), and (2) estimating the 
attribute values of the missing data items (i.e., the un¬ 
known unknowns value estimate). The naive estimator uses 
the Chao92 [T] species estimation technique to estimate the 
number of the missing data items, and mean substitution 
[37 to estimate the values of them. 





























Let <j>K = a ttr(r) be the current sum over the inte¬ 

grated database, then we can more formally define our naive 
estimator for the impact of unknown unknowns as: 

A naive = ■ (tf-c) (3) 

, _ . Count estimate 

Value estimate 

N is the estimate of the number of unique data items in the 
ground truth D, and c is the number of unique entities in 
our integrated database K (thus, N — c is our estimate of the 
number of the unknown data items), c^k/c is the average 
attribute value of all unique entities in our database K. 


3.1.1 Chao92 estimator 

Throughout the paper, we use the popular Chao92 es¬ 
timator. Many species estimation techniques exist [5] [6 , 
but we choose Chao92 since it is more robust to a skewed 
publicity distribution. The Chao92 estimator uses sample 
coverage to predict N. The sample coverage C is defined 
as the sum of the probabilities pi of the observed classes. 
Since the true distribution pi...pN is unknown, we estimate 
C using the Good-Turing estimator 14]: 


C = 1 - fi/n (4) 

The /-statistics, e.g., /i, represent the frequencies of ob¬ 
served data items in the sample, where fj is the number of 
data items with exactly j occurrences in the sample, fi is 
referred as singletons , doubletons , and /o as the missing 
data [2]. Sample coverage measures the ratio between the 
number of singletons (/ 1 ) and the sample size (n). This ra¬ 
tio changes with the amount of duplicates in the sample. 
The high-level idea is the more duplicates that exist in our 
sample S' compared to the number of singletons fi , the more 
complete the sample is (i.e., higher sample coverage). 

In addition, the Chao92 estimator explicitly incorporates 
the skewness of the underlying distribution using coefficient 
of variance ( CV ) 7 , a metric that is used to describe the 
dispersion in a probability distribution |Y|. A higher CV 
indicates a higher variability among the pi values, while a 
CV — 0 indicates that each item is equally likely (i.e., the 
items follow a uniform distribution). 

Given the publicity (p± ■••pn) that describe the proba¬ 
bility of the z-th class being sampled from D, with mean 
p m J ~fiPi/N — 1/iV, CV can be expressed as follows: 

r 1V2 


7 = 


^2(Pi-p) 2 /N 


Ip 


(5) 


i 


However, since pi is not available for all data items, CV has 
to be estimated using the /-statistic: 

!)/» 


7 = max 


n{n — 1 ) 


- 1,0 


( 6 ) 


The final Chao92 estimator for Nchao 92 can then be for¬ 
matted as: ^ ~ nh — i 

Ncnao 92 = 4 + ^-Ai-7 2 (7) 


3.1.2 The Estimator 

Nchao 92 is our estimate for N , and comparing this to c 
provides us with a means of evaluating the completeness of 
S. By substituting Nchao 92 for IV, the final naive estimator 
can be written as: 

A naive = ~ ■ ( N C hao92 ~ c) = tlijf ( C + 7 7 
c c ■ (n — /i) 

Note, that the naive estimator does not consider any publi¬ 


city-value correlation and thus tends to over- or under-estimate 
the ground truth. 


3.2 Frequency Estimator 

We developed a simple variation of the naive estimator, 
which makes direct use of the frequency statistics to improve 
estimation quality. All coverage-based species estimation 
methods give special attention to the singletons fi ; the data 
items observed exactly once. The idea is that those items, 
in relation to the sample size n, give a clue about how well 
the complete population is covered. A ratio of fi/n close to 
1 means that almost every sample is unique, indicating that 
many items might still be missing. Conversely, a ratio close 
to 0 indicates all unique values have been observed several 
times, decreasing the likelihood of any unknown data. We 
use a similar reasoning to improve our value estimation. The 
key idea is that singletons are the best indicator of missing 
data items, and that their average value might be a better 
representation of the values of the missing items. Let <fif 1 be 
the sum of all singletons, ^2 resing i etons attr(r) and Nchao 92 
again be the Chao92 count estimate. Then the estimator 
can be defined as: 

Afreq = ^ ■ (NchaoW ~ c) = ~ ^ + 7 ^ (9) 

h n — j i 

While this estimator still does not directly consider the 
publicity-value correlation , it is more robust against popu¬ 
lar high-impact data items (i.e., data items with extreme 
attribute values). For example, in our running employee 
example, big companies that are highly visible like Google 
or IBM can significantly impact the known value estimate 
4>k/c. However, through using the average value of the sin¬ 
gletons, 0/i/Zi, it is reasonable to assume that those compa¬ 
nies will not stay as singletons very long in any sample and 
thus will not impact the average value for the unknown un¬ 
knowns. This estimator is surprisingly simple and becomes 
even simpler if we assume y 2 = 0: 

Afreq = ( 10 ) 

n~ J 1 

Note, that y 2 = 0 makes it a Good-Turing estimate, which 
also converges to the ground truth even for skewed publicity 
values; it might just take a bit longer [T] . While A f req is 
not the best estimator (see Section [6]) the simplicity makes it 
still useful to quickly test if an aggregate query result might 
be impacted by any unknown unknowns. 

3.3 Bucket Estimator 


The problem with the previous two estimators is that they 
do not directly consider a correlation between publicity and 
attribute values. We designed the bucket as a first estimator 
designed for unknown unknowns with publicity-value corre¬ 
lation. The idea of the estimator is to divide the attribute 
value range into smaller sub-ranges called buckets, and treat 
each bucket as a separate data set. We can then estimate 
the impact of unknown unknowns per bucket (e.g., large, 
medium, or small companies) and aggregate them to the 
overall effect: A a \ /n\ 

A buc ket = 2_^ A (bi) (11) 


Here A^ refers to the estimate per bucket and both the 
frequency or naive estimator could be used. Using buckets 
has two effects: First, it provides a more detailed estimate 
on what types of companies are missing and related to that, 
second, the value variance per bucket decreases, making the 
estimate less prune to outliers (e.g., items with extreme low 









and high values can be “contained” in separate buckets). 

The challenge with the bucket estimator is to determine 
the right size for each bucket. If the bucket size is too small, 
the bucket contains almost no data items. In an extreme 
case of having a single data item per bucket, no count or 
proper value estimation is possible. If the bucket size is too 
big, then the publicity-value correlation can still bias the es¬ 
timate. In fact, the case with a single bucket is equivalent to 
using just the naive or frequency estimator. In the following 
we describe two bucketing strategies. 

3.3.1 Static Bucket 

An easy way to define buckets is to divide the observed 
value range into a fixed nb number of buckets of size wp. 

(cimax &min) 


Tib 


( 12 ) 


where amin ( Umax ) refers to the min (max) observed at¬ 
tribute value. Afterwards we apply A naive per bucket. It 
is important to note that the estimate goes to infinite with 
buckets which only contain singletons due to division-by¬ 
zero (n — fi — 0, see equation [§]) , which can significantly 
increasing the error of the estimate for very small buckets. 

Unfortunately, the optimal number of buckets varies de¬ 
pending on the underlying publicity distribution (see Ap¬ 
pendix [bJ . When the publicity distribution is more skewed 
and correlated to attribute values, some static buckets may 
contain too few data items, whereas others contain more 
than enough. The true publicity distribution is not known 
and we cannot predetermine the right number (or size) of 
static buckets. To this end, we found that static buckets 
based estimation is of little practical value. 

3.3.2 Dynamic Bucket 

To overcome the previously mentioned issues, we devel¬ 
oped several alternative statistical approaches to determine 
the optimal bucket boundaries over time. The most no¬ 
table are our uses of the error estimate/upper bounds from 
Section [4| an d of treating fi as a random variable (see also 
Section [3. 5 [ ). Surprisingly, we achieve the best performance 
across all our real-world use cases and simulations using a 
rather simple conservative approach, referred to as A Dynamic- 

The core idea behind our dynamic strategy A Dynamic is 
to sort the attribute values of S and then recursively split the 
range into smaller buckets only if it minimizes the estimated 
impact of unknown unknowns , i.e., the absolute A value. In¬ 
tuitively, this is controversial since either under- or overes¬ 
timation could be better for different use cases. However, 
there is a more fundamental reason behind this strategy. 

The Foundation: Whenever we split a data set into 
buckets, each bucket contains less data than before the split, 
and the chance of an estimation error increases due to the 
law of large numbers (i.e., the less data the higher the po¬ 
tential variance) [5T, 27 . To illustrate this, we consider the 
simplest case of a uniform publicity distribution (7 = 0 ) and 
an even bucket split. In this case, we can show that the 
Chao92 estimate for N is bigger or equal to the Chao92 N 
before the split: Before split 


Nchao92 — 


< 


1 - fi/n 
Tibi • Cbl 


+ 


n-fi 

Tlb2 • 0,2 


ribl — fl bl Ub 2 — fl 


(13) 


After split 

When we split the data exactly into halves, it follows that 


Algorithm 1: Dynamic bucket generation 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 


Input : S 

Output: List of buckets 

bo = (minValue(S), maxValue(S)) ; /* init bucket bo */ 
todo = [60]; /* list with bo */ 

bmin — abs(A(bo)) ; /* A estimate over bo */ 

bkts = [] ; /* final bucket list */ 

while \todo.empty do 

b = todo.pop ; /* remove first element */ 

btmp — abs(A(b)); 

tmp = (null, null) ; /* Empty pair */ 

for unique r E b do 

(U,t 2 ) = split(b,r.value) ; 

if $rnin > $tmp + abs(A(ti)) + afes(A(t 2 )) then 
bmin = Stmp + abs(A(ti)) + a&s(A(f 2 )); 
tmp = (ti, t 2 ); 

end 

end 

if tmp 7^ (null, null) then 
| todo.add(t\,t2); 
else 

| bkts.add(b); 

end 

end 


22 return bkts; 


Cbi = Cb 2 = c/2 (i.e., we split in regard to the unique val¬ 
ues). With a uniform publicity distribution, every item is 
equally likely, and therefore we can assume that both buck¬ 
ets contain roughly the same amount of data after the split: 
Tibi — Tib 2 ~ n/2. However, in contrast to n and c, the 
number of singletons (/ 1 ) can vary significantly between the 
buckets. In fact, we know that the estimators only stabi¬ 
lize if every item was observed several times [T] and as a 
consequence n has to be significantly larger than c and c 
significantly larger than fi (n c fi). Therefore, the 
variance of fi is relatively higher than the one of n or c be¬ 
tween the buckets and if we split, there is a higher chance 
that we unevenly distribute the f± among the buckets. 

To model the uneven distribution of fi we introduce an¬ 
other parameter a E [0,1] and set fi bl = a • fi and fi b2 = 
(1 — a)-fi. As a result the inequality in equation 13 becomes: 


-J fi 

Before split 


■ + 


f - (1 - a) ■ h 


(14) 


After split 


Appendix [C] shows that the right hand side of the above 
inequality has its global minimum at a = 0.5, which evalu¬ 
ates to nc/(n — fi) (N before split), and that the inequality 
always holds. Thus, it can be seen that splitting a data set 
into buckets not only potentially increases the error, but it 
does so in a monotonic way. 

Yet, this does not mean that the sum estimate A always 
increases as well. Especially with a publicity-value corre¬ 
lation, the overall estimate of A over all buckets can still 
decrease as the average attribute values per bucket differ. 
This is in-line with our original motivation to use buckets, 
as we wanted to get a more detailed unknown unknowns 
estimate (e.g., how many small companies vs. large compa¬ 
nies are missing). Bringing these two observations together, 
we can assume for many real-world use cases that when¬ 
ever our estimate of the impact of unknown unknowns A 
increases after a split, it has a significant chance of being 
caused by the increasing error in N, whereas when it de¬ 
creases it potentially improves the estimate due to the more 





















detailed unknown estimate. While it does not always have 
to be the case (e.g., if the publicity-value correlation is nega¬ 
tive) it is still an indicator for many real-world use cases (see 
Section [6]). Based on the observations, we have devised the 
conservative bucket splitting strategy: only split the bucket 
if the overall estimate for A is minimized. 

The Algorithm: Algorithm [l] shows the final algorithm. 
First we add a bucket which covers the complete value range 
of S to the todo list (line 2) and calculate the current A 
over S (line 3). Note that we take the absolute values of 
all estimates (A) to underestimate the impact of unknown 
unknowns even for the case of having negative attribute val¬ 
ues (e.g., net losses of companies). Afterwards, we check 
recursively if we can split the bucket to minimize A until no 
further “underestimation” is possible (line 5-21). 

We therefore remove the first bucket from the todo list 
(line 6) and calculate the A over S without the impact of 
this bucket b (line 7). Note, that during the first iteration 
dtmp will be 0. Afterwards, for every unique record in 5, we 
split the current bucket b into two temporary buckets t\ and 
£2 based on the record’s attribute value (line 10). If the re¬ 
sulting estimate using this split is bigger than any previously 
observed minimums (line 11), we set the new minimum to 
this value (line 12) and temporally store the new buckets 
(line 13). When the for-loop of line 9-15 finishes and if at 
least one new bucket was found (line 16), tmp will contain 
the new split point, which minimizes 5 for the bucket, and 
6min the new minimum value of 5. Those buckets are then 
added to the todo list (line 17) to be checked, if splitting 
them again would further lower the estimate. On the other 
hand, if tmp is empty, the algorithm wasn’t able to further 
split the bucket and the current bucket without any addi¬ 
tional splits is added to the final bucket list (line 19). If no 
buckets are left in the todo list, the algorithm terminates 
and bkts contains the final list of buckets. 


3.4 Monte-Carlo Estimator 

As our experiments show, the previous estimator actually 
performs very well (see Section [6|. However, what it does 
not consider is the effect of uneven contributions from data 
sources (i.e., one data source contains much more data than 
another) and the peculiarities of the sampling process it¬ 
self. The Chao92 species estimation, like almost all other 
estimators, assumes sampling with replacement, whereas our 
data sources sample without replacement from the underly¬ 
ing ground truth. The reason why the Chao92 still works 
is, that with a reasonably high number of data sources the 
integrated data source S approximates a sample with re¬ 
placement [46]. However, with either a small number of 
data sources or uneven contributions from sources (i.e., some 
sources are significantly bigger than others), S diverges sig¬ 
nificantly from a sample done with replacement, resulting in 
significant over- or under-estimation. In the case of crowd¬ 
sourcing, the workers which provide significantly more data 


items than other workers, are referred to as streakers 46 


To address these issues, we present a Monte Carlo -based 
(MC) estimator for N. The idea is that we simulate the sam¬ 
pling process to find the best distribution with its population 
size TV, which best explains the observed sample including 
how many items Sj every data source j contributes. More 
formally, given (si,..., si) what we seek is a set of parameters 
© (e.g., the distribution parameters) for the MC simulation, 
which minimize some distance function T between the ob- 


Algorithm 2: Monte Carlo method 


Input : 6 ft, 6 \, S , [ni, nbRuns 

Output: Average distance 


1 E = dist( 6 ft, 6 \); /* publicity of N items */ 

2 r = 0.0; /* default value */ 


3 

4 

5 

6 

7 

8 
9 

10 

11 

12 


for i = 1 to nbRuns do 

Q = Q; 

for j = 1 to l do 

Si = sample{rij , E); 
Q.add(si); 

end 

( Fs , Fq) = indexing(S , Q); 
F' s = smoothes, Fq); 

T += klDiv(F' s , Fq); 

end 


/* simulated model */ 
/* w/o repl */ 

/* KL-divergence */ 


13 return V/nbRuns; 


served data S and the simulated data Q©: 

argminr(S', Qe\l, [si,s;]) (15) 

e 

In the following we first describe the MC method for gen¬ 
erating Q© with given ©, the distance function T, and finally 
the search strategy to find the optimal parameter ©. 

3.4.1 Monte-Carlo Method 

In contrast to the other estimators, the Monte-Carlo es¬ 
timator requires an assumption about the shape of the un¬ 
derlying publicity distribution; in this work, we use an ex¬ 
ponential distribution for publicity , from which data source 
j samples nj data items. Accordingly, the parameter © has 
two components: On specifies the assumed number of data 
items, and 0\ governs the shape (skew) of publicity distribu¬ 
tion. Note, that the assumption of the exponential distribu¬ 
tions makes the MC method a parametric model. The goal 
of the MC simulation is to determine how well On and 0\ 
help to explain the observed S. 

Algorithm [2] shows our MC algorithm. First, we use an 
exponential distribution with skew 0\ to sample publicity 
(pi • • -Pft) for Oft items (line 1). And then we initialize the 
distance to 0 (line 2). Afterwards we repeat the following 
procedure nbRuns times. For every data source (line 5) 
we sample nj data items according to E, but also without 
replacement (line 6). The sampled items are added to Q 
to form a histogram (line 7) for the particular run. After 
simulating l sources, Q contains the simulated version of S. 

To finally compare the simulated sample Q with the ob¬ 
served sample S, we make use of the discrete KL-divergence 
metric [23 . However, this requires transforming S and Q 
into a frequency statistic and indexing them to ensure that 
the right items are compared with each other (line 9). 

After the indexing we have two comparable frequency 
statistics for S and the simulation: Fs and Fq- However, 
S might contain less than N unique data items, for which 
the KL-divergence is not defined. We therefore adjust Fs 
and assign a small non-zero probability to the missing extra 
unique items (line 10). Finally, the two frequency statistics 
can be compared using the standard KL-Divergence metric 
and added to the total distance (line 11) and after all the 
simulation runs the average distance is returned (line 13). 

3.4.2 Search Strategy 

We can now simulate the observed sampling process lead¬ 
ing to S, but we still need a way to find the optimal ©, 
which best explains the observed sample S. The difficulty 
is, that even though the KL-divergence cost function is con- 








Algorithm 3: Monte-Carlo based N estimation 

Input : [si, ...si\,c,N C hao92,nbRuns 

Output: Estimated number of unique data items, N 

1 Dkl = []; /* KL-divergence */ 

2 n = sizes([si, /* */ 

3 ©JV = [c : ( *cy c) : Nchac92]; 

4 © A = [-0.4 : 0.1 : 0.4]; 

5 for Oft e Sft do 

6 for Oft G ©a do 

7 T = monteCarlo(6ft, 6\,n,n r ); /* Alg 2 */ 

8 D^.oddjr); 

9 end 
10 end 

n p = curveFit(Q DkLi 2); /* 2-D curve fit */ 

12 [IV, A] = arg min{p(7V, A)} ; /* min on the curve */ 

Ae[—0.4,0.4],iV E[c,N C hao92l 

13 return TV; 

vex, the integer variable N prevents us from using tractable 
optimization algorithms (e.g., gradient descent). Further¬ 
more, the the distance function can be quite sensitive to 
small amounts of noise in D. 

We therefore make the estimator more robust by first per¬ 
forming a grid search for 0 (line 5-10). We vary On between 
c< N < N C hao 92 with a step-size ( N C hao 92 — c)/10 and 0\ 
between —0.4 < A < 0.4 (i.e., almost no to heavy skew) with 
a step-size 0.1 (line 2 and 3). The step sizes are chosen to 
be small enough to efficiently model the convex curve, but 
large enough to be robust to any noise. Afterwards, we fit a 
two-dimensional curve using least-squares curve fitting (line 
11) and return the Nmc with the minimum Dkl on the 
fitted curve as the final count estimate (line 11). 

Finally, to estimate the total difference, we use our naive 
estimation technique with Nmc • The estimate is more ro¬ 
bust and over-estimates less than the original naive esti¬ 
mator as our MC method always penalizes any unmatched 
unique items in Q. In other words, the MC estimator fa¬ 
vors solutions where N is closer to the number of observed 
unique items c. 

3.5 Other Estimators 

During the course of developing the above estimators, 
we explored various alternatives. For example, we exper¬ 
imented with alternative static bucket strategies (see also 
Appendix 0- Most importantly though, we noticed that 
many proposed techniques can actually be combined. For 
instance, we can use the frequency estimator, instead of the 
naive estimator, with the bucket (i.e., Dynamic Bucket ap¬ 
proach) estimator or the Monte-Carlo estimator. More in¬ 
terestingly, we can also combine the Monte-Carlo estimator 
with the bucket estimator. However, as the Monte-Carlo es¬ 
timator requires large sample sizes to be accurate, we found 
that it often decreases the estimation quality. Similarly, we 
found that the difference between the naive and frequency 
estimators does not help much for the bucket approach (see 
Appendix [ d|. For the experiments we therefore focus on the 
original techniques rather than the various combinations and 
included the other results in the appendix. 

4. ESTIMATION ERROR UPPER BOUND 

In this section, we derive an estimation error upper bound, 
specifically, the worst case estimation error of the naive es¬ 


timator (Equation |3|. The same upper bound can easily be 
applied to each bucket in the bucket estimator, as well as 
the Monte-Carlo estimator. 

To estimate the impact of unknown unknowns on SUM 
query results we multiply the estimate for the number of 
unknown data with the estimate of the values. Hence, we 
define the worst case estimate as the product of the worst 
case unknown data count and the worst case value estimate. 

The Chao92 count estimation is based on sample cover¬ 
age plus a correction for the skew 7 > 0. Recent work 
proposed a tight error bound of the Good-Turing estimator 
for the ground truth unknown unknowns distribution mass 
(Mo) r—— 

Mo < — + (2V2 + s/3)-\ (i 6 ) 


n V n 

which holds with probability at least 1 — e over the choice 
of the sample with n — |*S|. The confidence parameter e 
governs the tightness of this bound (we use e = 0.01 for 99% 
confidence). Based on equation |16[ we bound Chao92 : 
n(l — C) 


Nchao92 — ~ T 
O 


C 
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•7 


C 1 - Mo 


(17) 


< 


' log log 3/5 \ 


l-(^ + (2v / 2 + V3)- yfi 

Notice, that we can omit 7 as it only makes the Chao92 con¬ 
verge faster, but does not influence the asymptotic estimate, 
which is based on the sample coverage. 

As the distribution of the mean substitution (77) tend to 
a normal distribution ( Central Limit Theorem ), we define 
the worst case estimate of the ground truth attribute mean 
value (77) with the help of the sample standard deviation 

{CTK): * D < 

~N~~~ + (18) 

Here z controls the confidence of the bound, and we use 
z — 3 based on the three-sigma rule of thumb 49 to have 
nearly all values with 99.95% confidence lie below the upper 
bound. The final upper bound is then the simple multiplica¬ 
tion of the two worst case estimators (we present the results 
in Section 6.4): 

Abound — 
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l-(h. + (2V2 + V3)- 

5. OTHER AGGREGATE QUERIES 


(19) 


In this section we describe how the same techniques for 
SUM- aggregates can be applied to other aggregates for esti¬ 
mating the impact of the unknown unknowns. 

COUNT: Estimating COUNT is easier than SUM as it 
only requires estimating the number of unknown data items, 
but not their values. For instance, one could either directly 
use the Chao92 estimator or the techniques proposed in [46] . 
In addition, the bucket and Monte-Carlo approaches can be 
used simply by skipping the second step, i.e., not multiplying 
the estimated count with the value estimates. 

AVG: The simplest way to estimate the AVG with un¬ 
known unknowns is to use the AVG over the observed sam¬ 
ple S (i.e., the law of large numbers). This is reasonable 
because of the law of large numbers. However, S might 
be biased due to a publicity-value correlation and need to 
be corrected. One way to deal with the bias is to use our 
bucket approach with a simple modification on how the A5 
per bucket are aggregated (e.g., weighted average of averages 



















by the number of unique data items (Nchat/n) per bucket). 

MAX/MIN: At a first glance, it seems impossible to es¬ 
timate MIN or MAX in the presence of unknown unknowns. 
However, we can still do better than simply returning the ob¬ 
served extreme values by reporting when we believe that the 
observed minimum or maximum value is the true extreme 
values. This is already very helpful in many integration sce¬ 
narios and easy to do with our bucket estimator. The strat¬ 
egy divides the observed value range of S into consecutive 
sub-ranges (i.e., buckets); the number of unknown unknowns 
as well as their values are estimated per bucket. If the es¬ 
timated unknown unknowns count in the highest (lowest) 
value range bucket is zero, then we say that we have ob¬ 
served the true maximum (minimum) value and only then 
report the highest (lowest) value. 

6. EXPERIMENTS 

We evaluated our algorithms on several crowdsourced and 
synthetic data sets to test their predictive power. Crowd¬ 
sourcing allowed us to generate many real data sets and 
avoided the licensing issues which often comes with other 
data sources. We designed our experiments to answer the 
following questions: 

• How does the estimation quality between the different 
estimators compare on real-world data sets? 

• What is the sensitivity of our estimators in regard to 
data skew ( publicity-value correlation) and streakers/im¬ 
balance of data sources? 

• How useful is the upper bound? 

• How early are accurate MIN/MAX estimates possible? 

6.1 Real Crowdsourced Data 

We evaluated the estimation techniques on a number of 
real-world data sets, each gathered independently using Ama¬ 
zon Mechanical Turk, following the guidelines in [13]. Here 
we chose four representative data sets and four aggregate 
queries, which show different characteristics we encountered 
during the evaluation. 

1. US tech revenue &; employment: For the query: how 
much revenue does the US tech industry produce?, i.e., 
SELECT SUM(revenue) FROM us_tech_ companies, we used 
the crowd to collect US^Jtech company names and rev¬ 
enues. Similarly, in an independent experiment we asked 
for US tech company names and number of employees, in 
order to answer the question: how many people does the 
US tech industry employ ?, i.e., SELECT SUM (employees) 
FROM us_tech_companies. We selected the two data sets 
as they exhibit a steady arrival of unique answers from 
crowd workers. 

2. US GDP: As a proof-of-concept experiment, we asked 
crowd workers to enter a US state with its GDP. This 
data set suffered from streakers. 

3. Proton beam: Together with researchers from the field 
of Evidence Based Medicine (EBM) (group-name omit¬ 
ted for double blind reviewing) we created a platform 
for abstract screening and fact extraction and spent over 
$6,000 on AMT, to screen articles about 4 different topics. 
Here we utilize the results on one of these, namely Pro¬ 
ton beam: a set of articles on the benefits and harms of 
charged-particle radiation therapy for patients with can- 

2 

We asked for companies in Silicon Valley to get a representative 
sample of US tech companies; without restrictions we received too 
many tiny computer shops and even non-US based companies. 



# crowd answers 

Figure 4: The best US tech-sector employment 
cer. Part of the abstract screening asked workers to sup¬ 
ply the number of patients being studied. The question 
we aim to answer is how many people, in total, partici¬ 
pated in these type of studies: SELECT SUM (participants) 
FROM proton_beam_studies. This data set and research 
question is grounded in a real world problem and unlike 
the other queries, this one does not have a known answer. 

We paid between 2 and 35 cents per task. For the Pro¬ 
ton beam experiment we designed a qualification test and 
introduced hidden control tests to filter out bad workers ( 
reference is omitted for double blind reviewing), the other 
experiments were done without qualification tests. For the 
purposes of this study, we performed data cleaning manu¬ 
ally: if workers disagreed on the value (e.g., the number of 
employees of a company) we used the average. 

In the following we describe the results for every dat a se t 
and the following estim ators : Naive (naive) (Section 3.1), 
frequency (Freq) (Section |3.2| ), bucket (Bucket) (Section 3.3), 
and Monte-Carlo (MC) (Section |3.4| ) estimators (other es¬ 
timators did not perform that well or had the same perfor¬ 
mance and are only shown in Appendix |B| and |P| . 

6.1.1 US Tech-Sector Employment 

Figure [4] shows the SUM estimates from the different es¬ 
timators (colored lines) for our running example SELECT 
SUM (employees) FROM us_tech_companies as well as the ob¬ 
served SUM (grey line) over time (i.e., with an increasing 
number of crowd-answers). As the ground-truth (dotted 
black line) we used the US tech sector employment report 
from the Pew Research Center [39 . 

Both the naive and frequency estimators heavily overesti¬ 
mate the impact of unknown unknowns. The frequency esti¬ 
mator does slightly better than the naive estimator, which 
indicates that some big companies have a high publicity like¬ 
lihood and were observed early on by several sources. 

In contrast, the MC estimator does well until it falls back 
to the observed query result. This can be explained by a 
peculiarity of this experiment. After roughly 280 crowd- 
provided data item, all remaining companies have a rather 
uniform publicity likelihood. In such a case, the MC es¬ 
timator has a tendency to favor count estimates, which are 
similar to the number of observed items: Nmc ~ c. A major 
drawback of our MC estimation technique. 

Finally, the bucket estimator provides the best estimate 
(4053160.57 at 500 crowd answers), which is only ~ 2.5% 
above the ground truth (3951730). While it is possible that 
the bucket estimator might require more data to converge, it 
is also possible that the ground truth is inaccurate: the em¬ 
ployment statistics can vary widely based on many factors 

























(a) US Tech Revenue 



# crowd answers 


(b) US GDP 



(c) Proton Beam 


Figure 5: Real data experiments with aggregate SUM query 


(e.g., inclusion of part-time employees, tech sector defini¬ 
tion). We also speculate that there exist many smaller US 
tech start-ups that might be overlooked by survey research 
agencies, due to the high data collection cost. In contrast, a 
school of crowd workers can more easily find smaller start¬ 
ups and their number of employees on web-pages. Thus, the 
bucket estimate could be closer to the ground truth than the 
one by the Pew Research Center. This is an astonishing re¬ 
sult as the cost of crowdsourcing (e.g., $50.00 per 500 crowd- 
answers for US tech revenue & employment experiments) is 
probably only a small fraction of the cost of survey research 
by any major agency. 

6.1.2 US Tech-Sector Revenue 

Figure |5j a) shows the results for the US tech-sector rev¬ 
enue. In this data set, both the naive and the frequency 
techniques overestimate the ground truth significantly be¬ 
cause of the publicity-value correlation. While both estima¬ 
tors will eventually converge to the ground truth, it requires 
significantly more crowd-answers than what we collected. 

Again, both Monte-Carlo and bucket estimators provide 
better estimates than naive and frequency estimators .Yet, 
Monte-Carlo still overestimates, whereas bucket gives an al¬ 
most perfect estimate after 240 answers. However, it can 
also be observed that the bucket estimator slightly over¬ 
estimates at the end of the experiment. This happens be¬ 
cause one crowd-worker suddenly reported a few unique smaller 
companies causing the estimator to believe that there were 
more. Again, we cannot say with 100% certainty that our 
assumed ground-truth is actually the real ground truth and 
the bucket estimate might or might not be the real value. 

6.1.3 GDP per US State 

Figure[5|b) shows the estimate quality for our GDP exper¬ 
iment. To clean the data, we substituted the crowd reported 
GDP values with the values from ||5C|. This experiment suf¬ 
fered from streakers, i.e., uneven contributions from crowd 
workers. A single crowd-worker reported almost all answers 
in the beginning; this kind of aggressive behavior results in 
unusually high /i, which throws off the estimators. 

As the figure shows, only the Monte-Carlo based technique 
can actually deal with streakers and provides a reasonable 
estimate even in the beginning. However, it should also be 
noted that all estimators converge after 60 samples (for N — 
50). Furthermore, except for the Monte-Carlo estimator, 
there is no difference between the other estimators. 

6.1.4 Proton Beam 

Finally, results for Proton beam are shown in Figure [5jc) . 
Again the Monte-Carlo estimator follows the observed line, 


which makes the estimates less interesting. Furthermore, we 
suspect that the naive and the frequency estimators overes¬ 
timate with constantly increasing number of unique data 
items (reviewed articles). By manually examining the data 
set, we confirm that this crowdsourcing experiment did not 
encounter any streakers, which may cause our estimators 
(e.g., bucket ) to fail. Note that the bucket estimator con¬ 
verges to roughly 95 k, which we consider to be the best 
estimate of the number of participants for this particular 
type of cancer therapy effectiveness study. 

6.1.5 Discussion 

Overall, our bucket estimator has the highest accuracy. 
The only exception is when streakers are present, making 
the Monte Carlo to perform better. However, it should also 
be noted, that the run-time of the Monte-Carlo estimator 
is significantly higher than the other estimators. While not 
a serious issue for our experiments (roughly 3.5s for Monte- 
Carlo vs. 0.2s for bucket ), it could be significant for larger 
data sets, as the run-time scales linearly with sample size 
(the inner loop in Algorithm [ 5 ] depends on the sample size). 
In the remainder we analyze the different estimators in more 
depths using simulation and make final recommendations 
about which estimator to use at the end of the section. 

6.2 Synthetic Data Experiment 

To explore the estimation quality more systematically, 
we used a synthetic data set with N = 100 unique items, 
each having a single attribute-value ranging from 10 to 1000 
(attr = 10,20,30,..., 1000). We further simulated the sam¬ 
pling process outlined in Section [2] and used an exponential 
distribution with parameter A to model various publicity dis¬ 
tributions ( A = 0: uniform; A = 4: highly skewed). Finally, 
our simulation allowed us to vary the publicity-value corre¬ 
lation ( p = 0: no correlation; p — 1: perfect correlation - 
the most frequent item also has the largest value). 

Figure [6] shows the results for various synthetic data ex¬ 
periments, each of which is repeated 50 times and the re¬ 
sults averaged (we omit the error bars for better readability). 
From left to right, we vary the number of simulated crowd- 
workers (i.e., sources) from w — 100, 10 to 5. From top to 
bottom, we first assume no publicity skew and no publicity- 
value correlation (A = 0,p = 0), a for species estimation 
techniques often ideal scenario, we then show the more re¬ 
alistic scenario with skew and publicity-value correlation 
(A = 4, p = 1), and finally simulate an environment where 
some rare items might contain high values (A = 4, p = 0). 

Ideal: Looking at the top-left figure with a uniform pub¬ 
licity distribution and a hundred workers, we can see that 







































w=100 


H’= 10 


w=5 







345 4 «> 


Observed 

Naive — » 
Freq —-e- 
Bucket —* 
MC — * 
Ground Truth- 


.45 

.35 

Observed 

Naive — * 
Freq —-e- 
Bucket —* 
MC —«- 
Ground Truth- 


80 140 200 260 320 380 440 500 80 140 200 260 320 380 440 500 BO 140 200 260 320 380 440 500 


Observed -e— Bucket — m — 

Naive —«— MC —»— 

Freq Ground Truth- 


2.15 


55 k 

50 k 


45 k gr 
& 


Observed Bucket — m — 

Naive — m — MC —m— 

Freq — Ground Truth- 


3 40 3.85 4.45 


55 k 

50 k 

45 k 

40 k 


Observed Bucket —w— 

Naive — m — MC — m — 

Freq Ground Truth- 


55 k 


Observed Bucket — m— - 

Naive — m — MC —*- - 

Freq Ground Truth-J- 


2.20 


SO k 3 
O 

45 k ^ 
<o 

40 k 


Observed 

Naive 

Freq 

Bucket 

MC 

Ground Truth 


Sample Size 


Sample Size 


Sample Size 


Figure 6: Synthetic data with varying number of sources (re), degrees of publicity skew (A) & publicity-value correlation (p). 


all estimators perform very well from the beginning. This 
is not surprising as all estimators work best with sampling 
with replacement from a uniform publicity distribution; hav¬ 
ing many workers sampling without replacement from a uni¬ 
form distribution approximates sampling with replacement. 
With fewer numbers of workers sampling from the uniform 
distribution (top row), all estimators start to overestimate 
slightly. We conclude, that under the ideal conditions (i.e., 
the original assumptions of species estimation technique) all 
estimators perform equally well. 

Realistic: The middle row shows the scenarios which 
best resemble real-world use cases as it considers a skewed 
publicity distribution with a positive publicity-value correla¬ 
tion. In this case, the bucket estimator always provides the 
best estimates. However, in contrast to the real-world ex¬ 
periments the frequency estimator also performs well. This 
is due to a couple of reasons: Firstly, the publicity is highly 
skewed and perfectly correlated to the values. Secondly, the 
item values are evenly spaced. This helps the frequency es¬ 
timator to under-estimate as singletons consist of only rare 
low-valued items from the tail - a peculiarity of this simula¬ 
tion. Also interestingly, with 5 evenly contributing workers 
almost all estimators perform about the same. However, 
the bucket estimator has less variance (not shown). We con¬ 
clude, that under the more realistic conditions the bucket 
estimator performs the best and does not over-estimate the 
value. 

Rare events: Finally, we see in the bottom row that 
the bucket estimator is not the best choice. This is the case 
where we have skewed publicity , but no publicity-value corre¬ 
lation. In fact, all estimators perform poorly in this scenario, 
even with a lot of data sources (d). As the publicity distri¬ 
bution tail can take on any values (i.e., no publicity-value 
correlation , the tail (i.e., singletons ) can contain many high- 


impact values or “black-swan” events. In this case, because 
it conservatively favors underestimation, the bucket estima¬ 
tor performs worse. In summary, none of the estimators 
are able to predict black-swan events or the long tail; all the 
estimators underestimate the ground truth. 

6.3 Streakers 

We have seen in Section 16.1.31 that the estimators can 
heavily overestimate in the presence of streakers. We now 
examine the effects of streakers using the synthetic data set 
with n — 20, A = 1.0 and p — 1.0. 

First, we consider an extreme case where each source suc¬ 
cessively provides all N — 100 data items; first, one data 
source contributes n = 100 items and then the second source 
starts to contribute its n = 100 items, and so on. Figure [TJa) 
shows that Monte-Carlo simply defaults to the observed sum 
from one source (n = 100), whereas all other estimators fail. 
This is because of the fact that all (7hao92-based estimators 
assume a sample with replacement; an assumption which is 
strongly violated in this case. Only Monte-Carlo is more ro¬ 
bust against streakers as it tries to best explain the observed 
S using simulation. 

Next, we consider a more moderate case where we inject a 
single streaker (i.e., an overly ambitious crowd-worker). In 
Figure^b) a streaker is injected at the sample size n — 160, 
contributing all N — 100 unique data items directly af¬ 
terwards. Similar to the previous case, all estimators, ex¬ 
cept Monte-Carlo, heavily overestimate in the presence of a 
streaker. Again, the reason is that Monte-Carlo uses simu¬ 
lation to explain the observed sample S instead of assuming 
that S was created using sampling with replacement. 

6.4 Other Queries & Upper Bound 

In this subsection we present results for other aggregate 
queries than SUM using the techniques from Section [5] As 
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Figure 7: Streaker effect (a-b), estimation upper bound (c), AVG query (d) and aggregate MAX/MIN queries (e)(f) experi¬ 
ments using a synthetic data (A = 1.0, p = 1.0: larger values are more likely) 


before we use synthetic data with 100 unique data items 
(e.g., with values {10, 20, 30,1000}) integrated over 20 
sources with A = 1.0 and a publicity-value correlation p — 
1.0. The experiments are repeated 1000 times. 

AVG: Figure [TJc) shows the observed (gray line) and 
estimated (blue line) for a simple average query of the form 
SELECT AVG(attr) FROM table. We only show the bucket 
estimation, as other estimates exactly overlap the observed 
AVG query results (i.e., when all unknown unknowns as¬ 
sume the same observed mean value, the AVG query result 
is the same as the observed). As with the sum-aggregates, 
our dynamic bucket estimator is able to correct the bias of 
the average because of the publicity-value correlation and 
provides an almost perfect estimate in this scenario. 

MIN/MAX: Figure [?Jd- e) compactly visualizes the ob¬ 
served MIN or MAX query results. The heat-map shows 
when the real MIN/MAX value was observed in the data set 
(the darker the color the more often the result was observed 
given a number of samples over the 1000 repetitions). The 
green line shows on average, which value was reported if the 
unknown unknowns count estimate for the highest (MAX) / 
lowest ( MIN) bucket was zero. The text next to the green 
line shows how often over the 1000 repetitions the MIN/- 
MAX value was reported for a given sample size. As it can 
be seen the average is almost perfect for both MAX and MIN 
(note the actual minimum value is 10). That is, whenever 
our estimation technique for MAX/MIN reports a value the 
user can have more trust in it. It should be noted tough, 
that it is impossible to estimate rare extreme values (black 
swans). Thus, it is only possible to improve upon the confi¬ 
dence but not eliminate any doubts in the results. 

Upper Bound: Finally in Figure[7jf) we show the upper- 
bound from Section [4] using the same synthetic data set. 
As it can be seen, the bound is very loose (i.e., very large 
compared to our estimates) and becomes more tight as we 
observe more data. We observed the same behavior over 
the real-world data sets (omitted due to space constraints). 
While the upper bound provides a valuable insight, it may 
still be too loose for many real-world scenarios and we hope 
to improve it in the future. 

6.5 Summary 


Which Estimator To Use While the Monte Carlo or 
bucket estimators always dominate all the others, there is 
no clear winner between them. The bucket estimator per¬ 
forms exceptionally well unless the data sources are imbal¬ 
anced. It provides the best performance on the real-world 
use cases (except on the GDP experiment, which suffers from 
streakers); furthermore, it performs at least as go od as other 
estimators in the simulations from Section |6.2| (except for 
the rare event case, in which all estimators fail to predict 
black-swan events). However, when the data sources are 
imbalanced the Monte Carlo estimator wins. 

The reason is, that the bucket estimator is a sample coverage- 
based method as it uses Chao92 and thus, a nonparametric 
model , which does not require assumptions about the un¬ 
derlying distribution. However, it assumes a single sample 
without replacement. This assumption is not an issue as 
long as enough independent data sources exists (using sim¬ 
ulations we found that 5 sources are often sufficient, see 
Appendix 0 and every data source contributes evenly to S 
(i.e., there are no streakers). 

In contrast, the Monte-Carlo estimator is a form of a Data- 
Analytic Methods and really good at adjusting to the specif¬ 
ically observed sampling scenario (i.e., streakers), but at a 
cost of being a parametric model The method assumes an 
exponential distribution to model the publicity distribution, 
which can be good or bad depending on the true shape of the 
underlying distribution. Thus, our recommendation is to use 
the bucket estimator, when the analyst knows that enough 
data sources contribute evenly to the sample, and, other¬ 
wise, to use the more conservative Monte Carlo method. 

While theoretically the bucket estimator should be fairly 
accurate early on, the authors of [7] found that the Chao92 
estimator is inaccurate with very low sample coverage C 
(i.e., observed items are mostly singletons ) and reported re¬ 
sults for cases with C > 0.395 only. Based on that result, 
we make the general recommendation to use the estimates 
if the predicted sample coverage C (Equation [4]) is greater 
than 40%. 

Trust In The Results With any types of estimators 
the main question arises: How can we trust the estimate? 
In 1953, Good, who worked with Turing on the estimators, 
already pointed out that “I don’t believe it is usually possible 

































to estimate the number of species ... but only an appropriate 
lower bound to that number. This is because there is nearly 
always a good chance that there are a very large number 
of extremely rare species” [ 5 ]. In estimating the Impact of 
unknown unknowns , this statement is even more critical as 
the rare items can have extreme values. 

Yet besides this obvious risks and assumptions, species es¬ 
timation techniques are extensively used in biology and even 
helped to decipher the Enigma machine 14]. We actually 
believe that it comes down to a simple question: What do 
you trust more? A potentially wrong answer as no missing 
data is considered or a potentially wrongly corrected result. 
Now knowing, that with enough sources and no imbalance 
of sources, our bucket estimator rather under- than over¬ 
estimates, it can generally be said that it can only improve 
the answer (see the simulations and real-world experiments). 
With imbalance of and/or only a few data sources, the an¬ 
swer is less clear, as the estimators also more often over¬ 
estimate, even the conservative Monte Carlo technique (e.g., 
see Figure [5jb)) . Thus, the true answer lies probably some¬ 
where in between. With the help of our upper bound, we can 
give the user at least a value range and an idea where the 
true value might be. It should be noted though, that the 
upper bound requires also two new assumptions: an item 
probability of at least 1 — e and that the value mean follows 
a normal distribution, which in some rare cases might be 
violated. Still we believe, knowing something is wrong and 
a best guess, where the true value might be, is better than 
staying on the blind-side. In this work, we made a first step 
in the direction, while a lot remains to be done from de¬ 
veloping more tighter bounds, better ways to deal with the 
imbalance of sources, and easier ways to convey the meaning 
(and assumptions) of the estimates to the user. 

7. RELATED WORK 

Traditional query processing assumes the database to be 
complete (i.e., closed world assumption). Furthermore, nearly 
all sampling-based query processing techniques assume knowl¬ 
edge of the population size [l8]; hence, none of these are 
suitable for our problem with unknown unknowns. To the 
best of our knowledge, this is the first work on estimating 
the impact of the unknown unknowns on query results (i.e., 
aggregate query processing in the open world). 

Species estimation: Most related to this work are the 
various species estimation techniques, like Chao92 [Tj [5, 3 . 
Recent work [48] in this area even tries to estimate the shape 
of the population (e.g., support size, N). We could use these 
techniques in place of Chao92 to estimate the number of 
unknown unknowns , but not to directly estimate the impact 
of unknown , as the shape does not concern the values of 
unknown unknowns. 

Species estimation techniques have also been used to es¬ 
timate the size of search engine indexes and the deep web 
25]. The problem is similar to our unknown unknowns count 
estimation, and the most common technique (i.e., capture- 
recapture) is also based on the species estimation techniques 
[26j. However, they again do not consider the unknown un¬ 
knowns value. 

Species estimation techniques have also been used in the 
context of distinct value estimation for a database table [l8] 
[8]. However those techniques leverage the knowledge of the 
table size to avoid over-estimation. 

Survey Methodology &; Missing Data: There is a 


vast body of literature on sampling-based statistical infer¬ 
ences to estimate population statistics |45l|32l|20] or tech¬ 


niques to deal with missingness of values 143111 
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However, unknown unknowns are different from the miss¬ 
ing data; missingness refers to the case when the record is 
known, but one (or more) of the values/attributes is miss¬ 
ing. In addition, most of the techniques assume to know the 
population size to categorize something as missing (e.g., a 
registered subject participates and leaves before the study 
completes, a subject deliberately returns an empty ques¬ 
tionnaire, only this many subjects out of that many people 
responded, etc.) and, to some extent, knowing the cause of 
missingness (e.g., missing completely at random, missing at 
random, missing not at random) to select appropriate tech¬ 
niques. Moreover, the statistical inference techniques, e.g., 
multiple imputation based EM/maximum likelihood estima¬ 
tion 1, 10 , propensity score estimation [§], or Markov Chain 
Monte Carlo simulation |l|[52]) used to fill the missing vari¬ 
ables, require the known non-missing attributes of the record 
with missing values to be able to use an inference model. In 
the case of unknown unknowns , these assumptions are vio¬ 
lated as the entire record (i.e., all attributes) are missing. 

Missing data is also well studied in databases [40 37, 2l]; 
however, as traditional RDBMS query processing function 
under the closed world assumption, they do not consider un¬ 
known unknowns as part of the query processing and largely 
consider it a data cleaning aspect. 

Recent works 21] 41 defined database completeness in 
a partly open world semantic (i.e., database can be incom¬ 
plete, which causes incorrect query results) and use the com¬ 
pleteness information to denote the completeness of query 
results. Similar in spirit to our work, they investigate the 
impact on query results of entire database records that may 
be missing |M]; however, they also assume the knowledge of 
population size (e.g., there are 7 days in a week, there are 
this many cities in France) to define the completely missing 
records and measure the completeness. 

Sampling-Based Query Processing: To cope with 
aggregates over large data sets, sampling based estimation 
techniques have been proposed as part of query processing 
36, EH [42] . One limiting aspect of any sampling based esti¬ 
mation techniques, though, is that they assume a complete 
database (i.e., closed world). 


8. CONCLUSION 

Integrating various data sources into a unified data set is 
one of the most fundamental tools to achieve high quality 
answers. However, even with the best data integration tech¬ 
niques, some relevant data might be missing from the inte¬ 
grated data set. In this work, we have developed techniques 
to quantify the impact of any such missing data on simple 
aggregate query results. The challenge lies in the fact that 
the existence and the value of the missing data is unknown. 
To our knowledge, this is the first work on estimating the 
impact of unknown unknowns on query results. 

By nature, our techniques cannot predict black swan events 
(i.e., extremely rare data items) due to a heavily skewed 
publicity distribution. However, based on our evaluation re¬ 
sults, we believe that the proposed techniques can provide 
valuable insights for users; rather than blindly believing the 
closed-world query result, the user gets an idea of what the 
impact of unknown unknowns might be. 

There are several interesting future directions. Currently, 



















none of our estimators provides the best performance under 
all circumstances. The Monte-Carlo estimator is very robust 
against streakers, whereas the bucket estimator provides the 
most accurate results, if no streakers are present. How to 
develop a robust estimator in all scenarios remains an im¬ 
portant area for future work. Similarly, developing a tighter 
upper-bound for aggregate queries would be of great value. 
Finally, extending the proposed techniques for more com¬ 
plex aggregate queries (e.g., with joins) also remains open 
for future work. 

This work is an important step towards providing higher 
quality query results. After all, we live in a big data world 
where even an integrated data set over multiple sources is 
possibly incomplete. 
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APPENDIX 

A. SYMBOL TABLE 


n 

Universe of all valid entities (unknown size) 

r 

A valid unique entity or data item 

D 

Ground truth or the underlying population 

S 

Observed sample of size n = |5|, with duplicates 

K 

Integrated database with only unique entities from S 

U 

Unknown unknowns that exist in D : but not in S or K 

M 0 

Unknown unknowns distribution mass in D 

c 

The number of unique data items in S; c = \K\ 

s o 

Source j with nj = Sj data items 

N 

The size of the ground truth; N = \D\ 

<t> 

The aggregated query result: e.g., (f>p> (over D) 

A 

The impact of unknown unknowns : A = (f>£> — cfx 

h 

A frequency statistic, i.e., the number of data items 
with exactly j occurrences in S. 

F 

The set of frequency statistics, {/i, / 2 , •••, fn} 

P 

The correlation between publicity and value distribut¬ 
ions, i.e., publicity-value correlation 

7 

Coefficient of variance (data skew measure) 

C 

Sample coverage, also C = 1 — Mq 


Table 1: Symbols 


B. STATIC BUCKET BASED ESTIMATOR 

In Section |3.3.1| we state that the optimal number of buck¬ 
ets depends on the underlying publicity distribution. Here, 
we elaborate on this with the two examples. 



# crowd answers 

Figure 8: The best US tech-sector employment estimation 
with static buckets. Splitting into more buckets improves 
estimation. Eq-width (6-bkt, 10-bkt) are missing due to 
some of the buckets are empty. 

Figure [8] shows the US tech-sector employment estimates 
by various estimators: Naive (1-bucket), Bucket (a.k.a., 


Dynamic Bucket ), and Static Bucket (Eq-width and Eq- 
height). In this particular example, splitting into more buck¬ 
ets improves estimation, as the underlying publicity distri¬ 
bution is skewed and correlated to the values (i.e., larger 
companies are more well known). 


Uniform publicity, uncorrelated to values 
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Figure 9: Sum(l:10:1000) estimation with static buckets. 
Splitting less (e.g., Naive) improves estimation. Data points 
are missing when some buckets contain singletons only (i.e., 
infinite estimation). 

In contrast, in the simulated case in Figure [9] splitting 
into less (e.g., Naive) improves estimation as the underlying 
publicity is uniform. Notice, that in both examples above, 
the bucket estimator yields the best estimates, dynamically 
resizing buckets on its own. 

Also notice, that we consider two variants of static buck¬ 
ets: the one described in the paper, equi-width, which di¬ 
vides the observed value range into a fixed number of buck¬ 
ets, and another obvious variant, equi-height, which di¬ 
vides the observed sample, sorted by value, evenly into a 
fixed number of buckets. Both static bucket types are sim¬ 
ple to use, but they require parameter tuning for the optimal 
number of buckets, which is hard to predict without knowing 
the true publicity distribution. 
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C. THE INCREASE IN COUNT ESTIMATE 
AFTER BUCKET SPLIT 

In equation |14[ we claimed that the count estimation 
(. Nchao 92 = nc/{n — fi)) of a bucket increases after split¬ 
ting the bucket, if data items are evenly distributed over the 
attribute value range, and there is no publicity-value corre¬ 
lation: 

Before split 

_ c _ 

A Chao92 — - - 7 , — - 

1 — ji/n n-fi 

n c n c 

< 2 ‘ 2 |_ 2 ‘ 2 _ 

- f - a • /i f - (1 - «) • /i 

s ‘ v * 1 ^ 

After split 

The a parameter governs the split of the original singleton 
count (/i) into a pair of smaller buckets. We assume n and 
c are evenly distributed between the split buckets, as items 
are evenly distributed over the value range, and all values 
are equally likely (no value-publicity correlation). We now 
show that the above inequality holds by showing that the 
right hand side (after split) is minimized at nc/(n — fi). 
Note that nc/(n — fi) is a positive number as n > fi > 0 




























































and c > 0. 

To find the minimum, we take the first derivative of the 
right hand side (denoted by 1Z) with respect to a: 

v i = —c • fi • n —c * f\ • n 

4(-(l-a)-/ 1 + f)2 + 4(—a-/i + f) 2 
Solving 1Z' — 0, we get a — 0.5; we have 11(0. 5) = nc/(n — 
fi) as shown below: 

n c i n c 

2*2 "1" 2*2 Tt ‘ C 

= f - 0.5 • /i = n — fi 

Finally, we show 7^(0.5) = nc/(n — fi) is the minimum by 
ensuring 1Z" (0.5) > 0: 


1Z" 


U"( 0.5) 


c • fi • n c • fi - n 

2(-(l-a)-/i + f) 3 + 2(-a-/ 1 + f) 3 

c- fi ■ n c - /i • n 

2(—(1 - 0.5) • /i + f )3 + 2(—0.5 • /i + §)3 

c- fi • n _ 8c- fi ■ n 
(-0.5./! + f)3 - (—/1+n) 3 


Note that n > /i, and this makes 1Z" > 0] 1Z is minimized 
at nc/(n — fi) and the inequality holds true: 


Before split 


< 


n — fi ~ f - a • h f - (1 - a) • /i 


After split 


D. OTHER ESTIMATORS 

Many proposed techniques can be combined: we can use 
the frequency estimator, instead of the naive estimator, with 
the bucket (i.e., Dynamic Bucket approach) estimator or the 
Monte-Carlo estimator. We can also combine the Monte- 
Carlo estimator with the bucket estimator. 



# crowd answers 

Figure 10: The best US tech-sector employment estimation 
with other estimators 

However, as the Monte-Carlo estimator requires large sam¬ 
ple sizes to be accurate, combining it with bucket estimator 
often results in lower estimation quality (i.e., each bucket 
contains a smaller sample). Furthermore, each bucket (a 
smaller value range) entails a part of the underlying public¬ 
ity distribution; hence, the publicity distribution per bucket 
appears more uniform. As a major drawback, the Monte- 


Carlo estimator exhibits a tenden cy to favor its count esti¬ 
mate Nmc ~ c (see Section 6.1.1). Such tendency gets more 
imminent in Monte-Carlo with Bucket estimator as seen in 
Figure [lO] Similarly, we found that the difference between 
the naive and frequency estimators is not significant for the 
bucket estimator (i.e., uniform publicity). 

E. NUMBER OF SOURCES 

Bucket estimator is non-parametric and works well with 
with both uniform and skewed distributions; however, it 
assumes a sample S sampled with replacement. This as¬ 
sumption is appropriate as long as enough independent data 
sources contribute evenly to S. 




(a) w = 2 


(b) w = 3 



(c) w = 4 



(d) w — 5 


Figure 11: Synthetic data (A = 4.0,p = 1.0) with varying 
number of sources (ic). Bucket estimator performs better 
with more independent sources (i.e., more overlaps). 


In Figure [TT| we illustrate this with a synthetic data (skewed 
publicity correlated to item attribute values). In this partic¬ 
ular example, more than 5 sources result in enough overlaps 
for bucket to estimate accurately; however, the minimum 
number of sources would vary with the date set. In addi¬ 
tion, Monte-Carlo estimator converges faster as it does not 
assume a sample sampled with replacement. 


F. A TOY EXAMPLE 

In this section, we walk through the different estimators 
step by step using a simple toy example. Again, we use 
the same query, SELECT SUM (employee) FROM K, from the 
introduction but over a very simplistic data set, shown in 
Figure |12| It should be noted, that this toy example can 
not convey any statistical properties because of its small 
size, but we can explain the general reasoning behind the 
techniques using the example. 

Figure [12] shows the data integration scenario of our ex¬ 
ample. We assumes that the ground truth D consists of 
5 companies {A, B, C, D, E} (the bubble on the top), with 
different numbers of employees (e.g., company A has 1000, 
whereas company B has 2000). In the beginning we have 
four data sources {si, S2, S3, S4} each mentioning some of 
these companies, thus they sample without replacement from 
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(a) Multiple sources Si sampled without replacement from the unknown flb) Integrated Database K , before 

population D. S5 is added later to the original integrated database. (top) and after (bottom) adding S5 

Figure 12: A toy example for SELECT SUM (employee) FROM K 



before adding 35 

(n = 7, c = 3, /1 = 1, 7 2 = 0.1667) 

after adding S 5 

(n = 10, c = 4, fi = 1, 7 2 = 0) 

Ground Truth 

<t> D = 1000 + 2000 + 900 + 10000 + 300 = 14200 

Observed 

(f) K = 1000 + 2000 + 10000 = 13000 

1000 + 2000 + 10000 + 300 = 13300 

Naive 

^ , a ^ ■ h ■ (c + l 2 n) 

<pK + ^ naive — VK + / r \ 

c- (n — / 1 ) 

13000-1 -(3 + 0.1667- 7) 

= 13000 + ' 

3-(7-1) 

ss 16009 

„„„„„ 13300-1 • (4 + 0-9) 

4-(9-1) 

ss 14962 

Freq 

, A j. , 0/1 ( C + l Zn ) 

4>K + Af req = <pK + - 

n- f 1 

1000(3 + 0.1667-7) 

= 13000 + v ' 

7-1 

« 13694 

„„„„„ 300(4 + 0-9) 

= 13300 + v ' 

9-1 

= 13450 

Bucket 

<\>K + A bucket — 4>K + A b i: {A,B } + A b 2 :{D} 

— 4>K + {A, %aive\}) 1 + {A naive\}) 2 

, 3000 • 1 • (2 + 0 • 3) 

= 13000 + ' 

2 ■ (3 - 1) 

10000 ■ 0 ■ (1 + 0 • 4) 

= $K + A b i: {A,E} + A b2: {j3} + A b3 .{ D y 
— fK + {A naive }+ {A naive} {A naive\ b 3 

, 00 ™ 1300 • 1 • (2 + 0-3) 

= 13300 + ' 

2 - (3 - 1) 

2000 • 0 • (1 + 0 ■ 2) ( 10000 ■ 0 • (1 + 0 ■ 4) 

^ 1 • (4 - 0) 

= 14500 

^ 1 ■ (2 - 0) + 1 - (4 - 0) 

= 13950 


Table 2: SELECT SUM (employee) FROM K results with different unknown unknowns estimators: bucket estimator gives the most 
accurate estimation of 


D. For instance data source s 1 lists companies A, B , and 
D. In the example we also assume a publicity-value corre¬ 
lation; that is, the biggest company D appears in all data 
sources ({si, S2,S3,S4}), while smaller companies appear in 
fewer sources. To show how the estimates improve, we as¬ 
sume that the data source S5 is added later on (visualized 
through the plus). The tables in Figure [l2|b) show the inte¬ 
grated database before (top) and after (bottom) adding the 
fifth data source. For convenience, the last column shows, 
how many times each company was observed across the mul¬ 
tiple data sources. 

Table |2] shows the estimates by different estimators before 
and after adding the fifth data source. We exclude Monte- 
Carlo estimator due to its simulation based nature. The top 
row contains the relevant statistics of K. For instance, with 
4 data sources, the number of observed items / sample size 
is n — 7, the number of observed unique items is c = 3 (i.e., 
companies A, B, and D from the top table in Figure [T2|b)), 
the number of singletons fi = 1 (i.e., company D as it is 
the only company, which was observed exactly ones across 


the data sources). and the calculated coefficient of variance 
( CV ) 7 = 0.1667 calculated over the sample. 

Before adding the fifth data source, the observed total 
sum is <f>K — 1000 + 2000 + 10000 = 13000, after adding the 
fifth data source <f>K = 1000 + 2000 + 10000 — 13300. In this 
example, the observed total sum does not converge to the 
ground truth of 14200 

Table |2] shows the values with calculations for the dif¬ 
ferent estimators. As it can be seen, the naive estimator 
performs the worse; the estimator is quite far off, especially 
with 4 data sources. The reason is the value estimator ( mean 
substitution ) used. The average number of employees is 
(j>K /3 ~ 4333. Thus all missing companies (i.e., unknown 
unknowns ) are also assumed to be that big. Now knowing 
that bigger companies are more likely to be sampled, now 
the naive estimator heavily over-estimates. 

In contrast, the frequency estimator performs much better 
than the naive estimator because it assumes that the miss¬ 
ing companies have the average value over singletons , which 
includes A, but not the extremely big company D; the miss- 


































































ing companies are assume to have a value of (j)f 1 /1 = 1000 . 
Because less popular companies are more likely to be smaller 
(i.e., the publicity-value correlation ), this yields to a much 
better estimate. 

Finally, the bucket estimator performs the best. Before 
adding the fifth source, the algorithm creates two buckets: 
b\ : { A , B} and 62 : {D}. The estimate quality of bucket per¬ 
sists even after we add S 5 (i.e., Bucket is the best). In this 
case, the bucket estimator generates b\ : { A,E }, : {B} 
and 63 : {D}. The bucket estimator automatically groups 
the small companies (^4 and E ) together and uses their av¬ 
erage number of employees for the missing companies (all 
other buckets have unknown count estimation of 0 ); in this 
example, the bucket estimator has a smoothed value in be¬ 
tween 300 and 1000. This is particularly more desirable 
compared to the case of the frequency estimator: E is the 
new one and only singleton and (j)f 1 is now 300. 



